The Questions You Ask | SuvroGhosh.In

Acronyms expanded in this post:

AI: Artificial Intelligence. software that generates, classifies, predicts, summarizes, or acts on patterns in data.
API: Application Programming Interface. a controlled doorway through which software systems exchange data or actions.
CDA: Clinical Document Architecture. an older Health Level Seven standard for structured clinical documents.
CPT: Current Procedural Terminology. a United States coding system for medical procedures and services.
EHR: Electronic Health Record. the clinical system where patient care is documented and managed.
ETL: Extract, Transform, Load. a data pipeline pattern for pulling, reshaping, and loading data.
FHIR: Fast Healthcare Interoperability Resources. the modern web-friendly Health Level Seven healthcare data exchange standard.
HL7: Health Level Seven. the family of healthcare messaging and data exchange standards.
HL7 v2: Health Level Seven version 2. the older event-message standard still running much hospital integration.
ICD: International Classification of Diseases. a diagnosis classification system used for reporting, billing, and statistics.
IT: Information Technology. the practice of building, operating, and supporting computing systems.
LOINC: Logical Observation Identifiers Names and Codes. a terminology for lab tests and clinical observations.
US: United States. the United States of America.
UTHSCSA: University of Texas Health Science Center at San Antonio. a health science university and research institution in Texas.
VA: Veterans Affairs. the United States public healthcare system serving military veterans.

University of Texas Health Science Center at San Antonio [UTHSCSA, an academic health science center where clinical care, research, teaching, and grant-funded data work often meet uneasily in the same hallway]; Veterans Affairs [VA, the United States federal healthcare system for military veterans, famous in data circles for its long patient histories and equally famous legacy complexity]; Electronic Health Record [EHR, the clinical system used to document patient care]; Health Level Seven version 2 [HL7 v2, the old but stubbornly useful messaging standard that moves clinical events from one system to another]; Fast Healthcare Interoperability Resources [FHIR, the newer web-friendly healthcare data standard built around resources such as Patient, Observation, Encounter, and MedicationRequest]; Extract, Transform, Load [ETL, the plumbing work of pulling data from one place, reshaping it, and loading it somewhere else]; International Classification of Diseases [ICD, the diagnosis coding system used for classification, billing, reporting, and analytic grouping]; Current Procedural Terminology [CPT, the procedure coding system used heavily in US reimbursement and reporting]; Logical Observation Identifiers Names and Codes [LOINC, a terminology for naming laboratory tests and clinical observations]; Clinical Data Interchange Standards Consortium [CDISC, a standards body for clinical research data]; Study Data Tabulation Model [SDTM, a CDISC model used to submit standardized clinical trial data]; Clinical Trial Management System [CTMS, the operational system for managing trial logistics]; Clinical Data Management System [CDMS, the system for collecting, cleaning, and managing research study data].

The first mistake in healthcare analytics is thinking that a better question produces a better answer. It may. It may also produce a more beautifully laminated falsehood, suitable for PowerPoint, executive reassurance, and later disaster. In UTHSCSA and VA settings, the same sentence can mean two different things because the data was born under different weather. One came from academic medicine, with research, clinics, grants, departments, human improvisation, and enough committees to populate a tram depot. The other came from the VA, a vast federal healthcare organism with decades of patient continuity, strong internal identity, old architecture, and the peculiar dignity of a system that has survived because people kept repairing it while pretending it was not always on fire.

Ask, “What happened to this diabetic patient over ten years?” In UTHSCSA, the question usually begins as archaeology. The patient may have been seen in one clinic, referred to another, vanished into an outside hospital, returned with a PDF, joined a study, left the study, had labs done elsewhere, and reappeared five years later as if healthcare were a Bengali serial with missing episodes and a reincarnated uncle. The analyst reconstructs the story from EHR records, research databases, claims fragments, registries, lab feeds, and sometimes a spreadsheet whose name suggests innocence but whose contents have the moral stability of a fish market at 2 p.m.

Ask the same question in the VA and the floor feels different. The patient has often remained inside a national system for years. The identifiers are stronger. The visits are visible. The medications, labs, diagnoses, and encounters may stretch across time like railway tracks. Not clean tracks. Not bullet-train tracks. More like old Indian Railways tracks, slightly bent, patched, cursed at by generations, yet somehow carrying the train. That continuity is precious. It lets you ask longitudinal questions with seriousness. But seriousness is not the same as truth. The VA can show you what the system recorded over time. It cannot automatically show you what the disease meant, what the clinician intended, what the patient understood, or what was quietly worked around outside the formal record.

That is the main architectural point. Data transport is not semantic meaning. HL7 v2 can move an observation from a lab system to an EHR. FHIR can expose an Observation resource through an application programming interface [API, a software doorway through which systems request and exchange data]. Neither one guarantees that the receiving system, the warehouse, the analyst, the model, and the clinician all understand the observation in the same way. A message can travel perfectly and still arrive as a little philosophical corpse. The packet got there. The meaning did not.

This matters brutally for longitudinal analysis. In UTHSCSA, the longitudinal record is often assembled from broken furniture. You align dates, normalize codes, reconcile patient identities, guess whether missing data means “not done,” “done elsewhere,” “not loaded yet,” or “trapped forever in a departmental database maintained by one heroic person named Linda who retired in 2018.” Temporal order becomes slippery. Was the lab before the diagnosis? Was the medication started before the adverse event? Did the patient fail therapy, or did the pharmacy feed fail to arrive? The data warehouse may present neat rows, but underneath, the timeline is a sari folded by someone in a hurry.

In the VA, longitudinal analysis has more native muscle. You can see years of labs. You can follow medication trajectories. You can examine disease progression across sites. But the trap changes shape. Because the data is abundant, people trust it too quickly. A diagnosis that persists for years may be clinically meaningful, or it may be a problem-list fossil, carried forward like an ancestral brass pot nobody uses but nobody dares throw away. A medication may appear discontinued because the order expired, because the patient stopped taking it, because care moved elsewhere, or because the workflow encoded silence as fact. Longitudinal depth gives you continuity, not innocence.

Population insight is where the comedy becomes statistical and the statistics become political. In UTHSCSA, defining a population is often the hardest part. “All patients with chronic kidney disease” sounds simple until you ask where the kidney disease is hiding. Is it in ICD codes? Lab-derived estimated glomerular filtration rate [eGFR, a calculated measure of kidney function]? Nephrology visits? Registry membership? Research enrollment? Medication patterns? A physician note? A billing claim? Each definition catches a different animal. One catches the formally coded patient. One catches the physiologically abnormal patient. One catches the patient visible to a department. One catches the patient useful to a grant.

This is why representation failures are so often mislabeled as data quality failures. People say, “The data is dirty,” with the weary superiority of someone discovering mud during monsoon. But often the data is not dirty in the simple sense. It is representing something different from what the analyst thinks it represents. A CPT code may represent a billable procedure, not a complete clinical reality. An ICD code may represent documentation sufficient for billing or reporting, not the full diagnostic reasoning. A lab value may be accurate, but disconnected from the clinical context that explains why it was ordered. The failure is not that the data is bad. The failure is that the data is doing a job it was never asked to explain.

In the VA, population insight gets one great advantage: scale with continuity. You can identify large cohorts. You can follow them nationally. You can study patterns over long periods. For chronic disease, mental health, pharmacy use, readmission, multimorbidity, aging, and care fragmentation, that is gold. Not shiny corporate brochure gold. More like old mine gold, mixed with rock, sweat, and paperwork. But VA populations are not everyone. Veterans differ from the general population in sex distribution, age, exposure history, socioeconomic pattern, comorbidity burden, and the way care is accessed. A model or insight that works beautifully in the VA may wobble badly outside it. The denominator is stronger, but the world outside the denominator is still standing there, smoking a bidi and refusing to disappear.

Predictive modeling is where everyone becomes overconfident because the graph goes upward and the area under the curve looks respectable. In UTHSCSA settings, predictive models often begin with fragmented data. The modeler builds features from whatever can be reliably obtained: labs, visits, diagnoses, medications, demographics, maybe notes, maybe research variables, maybe claims. But the data may be cross-sectional when the disease is longitudinal. It may be clinic-heavy when the patient’s life is not. It may be research-clean inside a protocol and swampy outside it. The model learns the shape of available data, not the shape of reality.

In the VA, predictive modeling can use richer temporal features. That is a real advantage. You can model slopes in lab values, medication persistence, encounter frequency, comorbidity accumulation, and prior utilization. You can ask whether a patient’s kidney function is drifting downward, whether diabetes control is worsening, whether admissions cluster after medication changes, whether social risk shows up indirectly through missed appointments and travel burden. But the model still learns from the system’s habits. It may predict hospitalization partly because it has learned who gets watched, who gets coded, who gets referred, and who falls into the warm circle of documentation. The machine does not discover the patient like a saint with a stethoscope. It discovers the institutional shadow of the patient.

This is the non-obvious architectural insight: different healthcare systems do not merely store different data; they manufacture different kinds of patients. The UTHSCSA patient is often a composite patient, assembled across care, research, department, and referral boundaries. The VA patient is often a longitudinal institutional patient, visible through continuity but shaped by federal workflows, legacy structures, and veteran-specific care patterns. Neither is fake. Neither is complete. Each is a model of a person built by an organization that had other things to do that morning.

So the smart architect does not begin with, “Can we get the data?” That is a junior question wearing senior shoes. The better question is, “What kind of truth can this system support without lying?” For longitudinal analysis, the practical design implication is to preserve provenance aggressively. Every important fact should carry its source, timestamp, transformation history, terminology mapping, and confidence context where possible. Not because architects enjoy making life miserable, though some do, but because time without provenance is gossip in database form.

For population insight, cohort definitions must be versioned like code. You do not define “heart failure population” once and then wander away proudly. You define it using diagnosis codes, medication patterns, ejection fraction values, encounter types, and exclusion logic, then you record the definition, test it, revise it, and state what it misses. The same applies to diabetes, kidney disease, sepsis, depression, readmission risk, trial eligibility, and all the other grand nouns that behave nicely in meetings and badly in production.

For predictive modeling, the first validation should not be accuracy. It should be representational sanity. Are the features clinically meaningful? Are they artifacts of workflow? Does the model depend on variables that appear only after the outcome is already unfolding? Does missingness mean absence, delayed loading, outside care, patient nonadherence, or system blindness? A model can be mathematically impressive and architecturally foolish. This is not a rare species. It breeds in dashboards.

The clean solution, naturally, does not exist. UTHSCSA cannot magically become a closed national system with perfect continuity. The VA cannot simply shed decades of legacy semantics and wake up as a modern FHIR-native paradise with mangoes falling from the trees. Regulations constrain access. Procurement freezes strange choices into stone. Departments protect their systems. Research protocols narrow data definitions. Billing distorts clinical representation. Legacy interfaces keep running because replacing them would risk the hospital’s daily metabolism. The hospital may dislike its old plumbing, but it still needs water at 6 a.m.

FHIR helps. It gives us a cleaner grammar for exchange. But grammar is not literature. FHIR resources can carry structured data more elegantly than older patterns, but they do not automatically resolve whether a diagnosis is active, suspected, historical, billing-driven, clinician-confirmed, patient-reported, or copied forward by a tired resident at midnight. HL7 v2 is often mocked as old, but it continues to move the blood of healthcare operations. CDA can preserve narrative richness while frustrating computability. CDISC and SDTM can make research submissions consistent, but consistency inside a study is not the same as meaning across clinical life. Standards reduce chaos. They do not abolish interpretation.

The practical path is humbler and more durable. Ask different questions in different settings. In UTHSCSA, ask how much reconstruction the answer requires. Ask which systems are missing. Ask whether the cohort is a clinical population, a research population, or merely a population visible to the data platform. Ask how departmental ownership has shaped the record. Ask whether a beautiful analytic result is just the echo of who got captured.

In the VA, ask what continuity hides. Ask whether long records contain stale problems, inherited assumptions, legacy code meanings, and workflow fossils. Ask whether national scale is creating confidence faster than interpretation. Ask whether the veteran population supports the inference being made. Ask what changed when systems migrated, when coding practices shifted, when terminology mappings were introduced, and when a local workaround quietly became unofficial architecture.

That last phrase matters. Human workarounds become architecture. A nurse’s spreadsheet, a research coordinator’s tracker, a local interface patch, a clinic’s habit of coding one way rather than another—these are not side stories. They are the little bamboo bridges by which healthcare crosses its own swamp. Later, when analysts arrive, the bridges are invisible. Only the data remains, looking official, wearing shoes, pretending it was born in a cathedral.

So yes, ask about longitudinal analysis. Ask about population insight. Ask about predictive modeling. But ask first what the system thinks a patient is, what it thinks an event is, what it thinks time is, what it refuses to know, and what it has been forced to remember because reimbursement, regulation, research, or operational survival demanded it.

In healthcare data, questions are not innocent. They are instruments. Point one at UTHSCSA and it behaves like a lantern in a crowded academic bazaar, showing fragments, specialists, studies, referrals, and clever local repairs. Point one at the VA and it behaves like a miner’s lamp in a deep institutional tunnel, showing continuity, scale, history, and old marks on the wall whose meanings have partly survived and partly evaporated.

The job of the architect is not to worship either lamp. It is to know what kind of darkness each one fails to illuminate.

Related Posts